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A practical  algorithm  is  described  that  allows  an 
LR  parser  to  parse  past  the  point  at  which  an  error  was 
detected.  By  thus  parsing,  context  beycnd  the  point  of 
error  detection  is  gathered.  We  prove  several  important 
properties  about  this  "forward  context"  and  demonstrate 
its  usefulness  in  the  selection  and  evaluation  of  error 
repairs.  At  first  specifically  restricting  our  consi- 
deration to  single  occurrences  of  errors  of  insertion, 
deletion,  or  replacement  of  a single  terminal  symbol, 
we  show  how  to  use  the  algorithm  and  suggest  possible 
error  repair  strategies.  Then  we  suggest  a generalization 
to  encompass  recovery  from  any  number  and  type  of  error. 

Our  work  is  related  to  the  similar  work  of  Graham 
and  Rhodes  for  simple  precedence  parsers.  We  not  only 
extend  their  concept  to  LR  parsers  but  derive  properties 
about  forward  context  that  can  significantly  assist  an 
error  repair  strategy. 
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Chapter  1 . 

INTRODUCTION 

Graham  and  Rhodes  [G&R  75]  have  proposed  an  error 
recovery  scheme  for  bottom-up  deterministic  parsers  that 
involves  "condensing"  context  about  the  point  at  which  an 
error  was  detected.  A "backward  move"  condenses  the  con- 
text to  the  left  of  the  error  point,  and  a "forward  move" 
gathers  context  to  the  right  of  the  error  point.  Such 
context  is  valuable  input  to  an  error  repair  strategy. 

In  their  paper  they  show  how  the  condensation  is  done 
for  simple  precedence  parsers,  and  give  an  error  repair 
strategy  that  uses  the  condensed  context. 

We  investigate  the  condensation  problem  for  LR 
parsers  (by  which  we  mean  to  include  LR (k)  and  all 
its  variants  — SLR(k) , LALR(k),  etc.).  We  give  a 
practical  algorithm  that  allows  an  LR  parser  to  perform 
the  forward  move,  prove  several  properties  about  the 
algorithm  relevant  to  error  repair,  and  suggest  ways  that 
the  "forward  context"  may  be  used  in  an  error  repair 
strategy.  We  do  not  treat  the  backward  move  since  we  are 
not  convinced  of  its  usefulness  in  LR  error  recovery. 

Chapter  2 introduces  terminology,  both  standard  and 
nonstandard,  to  describe  the  concepts  involved  in  LR 


parsing.  Chapter  3 gives  a preliminary  version  of  the 
forward  move  algorithm.  The  algorithm  works  by  carrying 
along  in  parallel  all  possible  parses  of  the  input  text 
following  the  error  point,  halting  when  the  parses  do  not 
agree  as  to  the  next  move  the  parser  should  make,  when  the 
parser  must  make  reference  to  the  context  to  the  left  of 
the  error  point  in  order  to  proceed,  or  when  another  error 
occurs.  The  halting  conditions  give  the  algorithm  important 
properties  that  can  substantially  assist  an  error  repair 
strategy  in  the  selection  and  evaluation  of  repairs.  These 
properties  we  prove  in  Chapter  4.  The  most  important  is 
that  the  forward  context  produced  by  the  forward  move 
algorithm  can  be  used  to  efficiently  verify  that  a repair 
attempt  is  in  a sense  "consistent"  with  the  input  text 
consumed  by  the  forward  move. 

In  Chapter  5 we  give  a framework  for  error  recovery: 
error  recovery  algorithm  = forward  move  + error  repair 
strategy.  Limiting  ourselves  initially  to  the  consideration 
of  a few  (but  the  most  common)  types  of  errors:  errors  of 

insertion,  deletion,  or  replacement  of  a single  terminal 
symbol,  we  show  how  to  use  the  forward  move  algorithm  to 
gather  forward  context.  We  suggest  ways  that  the  forward 
context  may  be  used  to  assist  an  error  repair  strategy, 
based  upon  the  properties  proved  in  Chapter  4. 

Finally  we  convert  the  algorithm  in  Chapter  3 to  an 
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equivalent  but  practical  algorithm.  The  algorithm  in 
Chapter  3 explicitly  carries  along  the  parallel  parses;  in 
Chapter  6 we  recode  the  algorithm  in  terms  of  additional 
states  and  transitions  between  them,  in  essentially  the 
same  way  a nondeterministic  finite-state  machine  is  con- 
verted to  a deterministic  finite-state  machine.  The 
recoded  algorithm  carries  the  parallel  parses  implicitly, 
and  is  about  as  efficient  as  the  LR  parsing  algorithm. 

Chapter  7 summarizes  and  lists  further  areas  of 
research. 

Druseikis  and  Ripley  [D&R  76]  have  solved  the  forward 
move  problem  for  SLR  parsers;  we  contrast  our  technique 
to  theirs. 


DEFINITIONS  AND  TERMINOLOGY 


and  their  construction.  We  establish  terminology  for 


them,  both  standard  and  nonstandard 


LR"  we  mean  to 


LALR(k) 


include 


etc.  Those  unfamiliar  with  LR  parsers  should  consult 


A context  free  grammar  (CFG)  is  a quadruple 


terminals,  nonterminals,  start  symbol,  and  productions 


otherwise  specify,  adhere  to  the  following  conventions 


for  Latin  letters 


closure.  Productions  are  elements  of  this  relation.  Thus 
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w -*■  w2  iff  for  some  A e N,  v e T* , w,  y e V*, 

= yAv  and  w2  = ywv  and  A w e P. 

This  is  the  rightmost  derivation;  for  the  purpose  of  LR 
parsing  we  are  not  interested  in  any  other  definition  of 
derivation.  Further,  we  assume  that  the  grammar  contains 
a production  of  the  form  S -*•  S'J_,  where  S and 
appear  in  no  other  production.  S'  e N,  and  ]_  e T.  A 

(rightmost)  sentential  form  of  G is  a string  y e V* 
such  that  S y.  A sentence  of  G is  a sentential  form 
consisting  entirely  of  terminals. 

Associate  with  each  production  A -*■  w e P a special 
symbol  #A_,W  not  in  V.  If,  for  some  A e N,  y,w  e V* 
and  v e T* , S yAv  -+  ywv,  we  define  yw#A_^w  to  be 
the  characteristic  string  of  the  sentential  form  ywv, 
and  any  prefix  of  yw  is  called  a valid  prefix  of  G. 

Each  sentential  form  of  an  unambiguous  grammar  has  a unique 
characteristic  string,  and  the  set  of  all  characteristic 
strings  of  a grammar  is  a regular  set.  A characteristic 
finite-state  machine  (CFSM)  of  G is  a deterministic 
finite-state  machine  that  recognizes  the  characteristic 
strings  of  G [DeR  69] . 

A finite-state  machine  (FSM)  is  a 5-tuple  (K, START, 
SIGMA, V,F)  where  K is  a finite  set  of  states,  START  e K 
is  the  start  state , F c.  K is  the  set  of  final  states , V 
the  vocabulary,  and  SIGMA  the  transition  function  mapping 
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K x v into  K.  Let  G=(N,T,S,P).  A CFSM  of  G is  the 

FSM  (K, START, SIGMA, V' ,F)  where  V'  =NUtU{#  IpeP} 

P 

and  the  states  of  K are  sets  of  items , marked  productions 
of  the  form  A x.y  is  the  marker)  where  A + xy  e P 

START  contains  the  item  S . S'j_,  among  others.  Each 
nonempty  state  q in  K has  one  or  more  successors  under 
SIGMA.  START  has  the  successor  state  {S  -*■  S'.J_},  among 
others.  In  general,  a state  q has  an  s-successor  for 
each  symbol  s in  NUT  that  is  preceded  by  the  marker 
dot  in  one  of  q's  items.  If  q contains  an  item  A -*■  w. 
with  a marker  to  the  right  of  all  symbols  in  the  right  part 
of  the  production  (.such  an  item  is  called  a final  item)  , q 
has  a #A_>.W  “successor  that  is  the  empty  set,  which  is  the 
only  final  state  (i.e.  F = { {}  }).  The  s-successor  of 
q is  called  a terminal  read  successor  if  s e T,  non- 
terminal read  successor  if  s e N,  or  reduce  successor  if 

s e {#  I p e P).  The  reader  should  consult  [DeR  69]  for 
P 

the  details  of  the  computation  of  K.  We  express  the  fact 
that  SIGMA(q,s)  = q'  by  the  transition  q — > q'.  All 
nonempty  states  have  a unique  accessing  symbol  defined  as 
follows:  if  a state  q is  the  s-successor  of  a state  q' , 

then  the  accessing  symbol  of  q is  s.  This  definition 
does  not  cover  the  state  START,  to  which  we  assign  the 
accessing  symbol  ]_. 

A CFSM  state  having  only  read  successors  is  called 
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a read  state . Any  state  having  one  reduce  successor  and 
zero  or  one  nonterminal  read  successors  is  called  a reduce 
state.  States  having  two  or  more  reduce  successors  or 
having  one  or  more  reduce  successors  and  one  or  more 
terminal  read  successors  are  called  inadequate  states . 

All  states  in  K are  covered  by  these  three  definitions 
except  the  final  state  {}. 

A path  of  the  CFSM  is  a sequence  of  states 

q0,  q.,  . ..,  q such  that  there  exist  transitions 
W1  n w2 

q0  > qx,  qx  > q2,  . ..,  q^  — — > qn  in  the 

CFSM,  and  w = w^w^  . . . w^  is  the  string  spelled  out  by 

the  path.  w e V'*  describes  a path  from  q^  to  q^ 

in  the  CFSM  iff  there  exists  a path  qQ,  . ..,  qn  and 

the  path  spells  out  w.  For  brevity  we  say  "q^  gets  to 
qn  by  w" . For  any  path  P,  Top  P indicates  the  last 
state  in  the  sequence,  i.e.  if  P = qQ,  q^,  ...,  q^  then 
Top  P = qn.  If  qQ  gets  to  qn  by  w,  then  [qQ:w] 

is  the  sequence  of  states  qQ,  q^ , . . . , q^  that  is  the 

path  from  q^  that  spells  out  w (in  a CFSM  this  path 
is  unique) . w accesses  q if  START  gets  to  q by  w. 
We  abbreviate  [START:w]  by  [w] . The  concatenation  of 
two  paths  [q:y]  and  [q':y'],  where  Top  [q:y]  = q', 
is  written  [q:y][q':y'I  and  designates  (q:yy'I  (that 
is,  we  do  not  repeat  the  state  q'  in  the  concatenation 
of  the  paths) . 


IF 


raw 


r 

K « 
1 


^\WW'  inn  u-u  , wm,  ”” "T  ■ v 


8 


For  parsers  with  1-symbol  look-ahead  a look-ahead  set 
of  terminal  symbols  is  attached  to  each  final  item  in  the 
states  of  the  CFSM.  (Computation  of  the  look-ahead  sets 
may  or  may  not  affect  the  construction  of  the  CFSM.)  We 
use  function  LA(q,A  ■+  w)  to  represent  the  look-ahead  set 
for  final  item  A ->■  w.  in  state  q.  The  LR  parser  for 
G is  the  CFSM  of  G plus  a parser  decision  function 
PD  mapping  K * V into  {road)  u {accept}  _ PD(q,s)  = 

{read  | q — — > q'  and  s e T— { _[_} } U {A  + w | q — > q' 
and  s e LA  (q , A -*■  w)  } U {accept  | q = { S -*•  S ' . J_>  and 
s = ]_}  . The  grammar  G is  LR  iff  |PD(q,s)|  £ 1 for 
all  q e K,  s e V.  Equivalently,  for  each  inadequate 
state,  the  1-symbol  look-ahead  sets  for  final  items  are 
disjoint,  and  if  the  state  has  an  s-successor,  then  s 
is  in  no  look-ahead  set. 

For  later  reference  we  present  the  LR  parsing 
algorithm,  which  uses  the  CFSM,  PD,  and  a pushdown 
store  called  the  state  stack.  By  "reading  a symbol"  we 
mean  that  the  parser  strips  the  input  text  of  its  first 
terminal  symbol,  exposing  the  next  symbol  to  be  read. 

We  assume  that  the  last  symbol,  and  only  the  last  symbol, 
of  the  input  is  J_.  Parsing  is  accomplished  by  the 
following : 

LR  parsing  algorithm  (LRPA) . 

Push  START  on  the  (empty)  state  stack 
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Repeatedly  parse  according  to  the  following: 

Let  h = head  of  input,  q = state  on  top  of 
state  stack. 
do  case  PD(q,h) : 

case  {read}:  Read  the  symbol  h and 

push  SIGMA (q , h)  on  the  stack. 
case  {A  -+  w}  : Pop  |w|  items  off  the  stack. 

Let  q be  the  new  top  of  stack. 

Push  SIGMA (q, A)  on  the  stack. 
case  {}:  Halt,  signalling  an  error 

and  rejecting  the  input. 
case  {accept}:  Halt,  accepting  the  input. 

case  otherwise  (i.e.  |PD(q,h)|  > 1): 

Halt,  confused;  the  parser  cannot  decide 
between  the  actions  presented  it.  If  G 
is  LR,  this  step  will  never  be 
encountered . 
end  LRPA 

We  refer  to  a configuration  of  the  parser  as  a pair 
(Z,R)  where  Z is  the  state  stack  and  R is  the  remain- 
ing (unread)  portion  of  the  input.  Thus  the  parser  starts 
out  in  the  configuration  (START, R)  where  R is  the 
input.  The  parser  makes  transitions  from  one  configuration 
to  another  via  moves , members  of  P VJ  {read}  y {accept}. 

PD  maps  K x v into  a set  of  moves.  We  use  to 
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indicate  the  parser's  transitions  from  one  configuration 
to  another,  and  |—  and  |—  as  the  ref lexive-transitive 
and  transitive  closure  of  |— , respectively.  Thus  case 
{read}  of  LRPA  can  be  stated  as  (Zq,hR)  |—  (Zqq',R) 
where  q'  = SIGMA(q,h),  and  case  {A  ■+■  w}  as 

(Zqq1q2  ...  qjw|,hR  |-  (ZqqA,hR)  where  PD(q|w|,h)  = 

{A  -*■  w)  and  qA  = SIGMA(q,A).  The  parser  accepts  iff 
(START, R)  |—  ([S']  , J_)  ; we  use  the  synonym  accept  for 
([S']  , J_)  . We  define  the  relation  reduces  to  as  follows: 
(Z,hR)  reduces  to  (Z',hR)  iff  PD (Top  Z',h)  is  either 
{read}  or  {accept},  i.e.  all  possible  reductions  on  Z 
with  h as  the  next  of  input  have  been  carried  out,  and 
the  parser  is  prepared  to  read  or  accept. 

(Many  LR  parser  implementations  do  not  attach  look- 
ahead sets  to  final  items  in  reduce  states,  but  only  to 
final  items  in  inadequate  states.  This  allows  somewhat 
smaller  parse  tables,  a slightly  faster  parser,  and  perhaps 
less  look-ahead  set  computation  time.  We  regret  that 
the  forward  move  algorithm  precludes  the  use  of  this 
efficiency  technique.  However,  the  payoff  is  earlier 
detection  of  errors  and  better  error  recovery  than  when 
the  efficiency  technique  is  employed.) 
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Chapter  3 . 

FORWARD  MOVE  ALGORITHM 

When  an  error  occurs  during  parsing  (case  { } of  LRPA) , 
we  would  like  to  invoke  a mechanism  that  performs  the 
"forward  move"  of  Graham  and  Rhodes,  i.e.  parses  some  of 
the  remaining  input  without  regard  to  the  text  already 
parsed.  In  an  LR  parser,  this  means  that  the  forward 
move  proceeds  without  referencing  the  left  context  already 
developed  on  the  state  stack.  For  example,  the  Algol 
symbol  "do"  can  appear  in  two  contexts:  in  a "for"  or 

"while"  statement.  If  "do"  is  unexpectedly  encountered 
by  LRPA,  the  forward  move  would  resume  parsing  without 
knowing  which  of  these  two  contexts  the  "do"  actually 
appears  in  (if  either) . We  would  parse  ahead  as  far  as  we 
could  without  referencing  the  context  to  the  left  of  the 
error  point,  halting  when  we  can  no  longer  parse  independent 
of  that  context,  and  ending  up  with  a fragment  of  a 
sentential  form  representing  the  text  we  parsed.  A grammar 
for  an  Algol-like  language  appears  in  Figure  1.  Consider 
the  would-be  program  in  this  language 
begin  integer  X,  J;  J :=  0; 

for  X :=  1 step  1 until  do  begin  J :=  X end 
end. 


where  we  omitted  the  limiting  value  in  the  "for"  statement. 

Upon  detecting  the  error,  LRPA's  state  stack  (writing 
only  the  accessing  symbols  of  the  states)  would  appear  as 
begin  Stmt  ; Stmt  ; for  Id  :=  Exp  step  exp  until 
where  we  have  capitalized  nonterminals  and  left  terminals 
uncapitalized.  Now,  mark  the  top  of  the  stack  with  the 
symbol  ?,  and  attempt  a forward  move.  We  might  read  as 
far  as  the  penultimate  "end",  resulting  in  the  new  stack 
J_  begin  Stmt  ; Stmt  ; for  Id  :=  Exp  step  Exp 
until  ? do  Stmt 

The  forward  move  halts  presumably  because  the  appearance 
of  the  last  "end"  indicates  that  we  should  reduce  either 
with  the  production  "Stmt  -+  for  Id  :=  Exp  step  Exp  until 
Exp  do  Stmt"  or  with  the  production  "Stmt  -►  while  Exp 
do  Stmt" , and  we  do  not  know  which  is  applicable  without 
looking  at  the  stack  to  the  left  of  the  ?.  Reducing  the 
text  "do  begin  X :=  J end"  to  "do  Stmt"  did  not  require 
reference  to  the  context  to  the  left  of  ?;  no  matter 
whether  a "for"  or  a "while"  appears  earlier  on  the  stack, 

"do  begin  X :=  J end"  should  always  be  reduced  to  "do 
Stmt".  We  call  the  text  read  during  the  forward  move  the 
forward  text  and  that  phrase  fragment  to  which  the  text 
is  reduced  the  forward  context. 

We  describe  an  algorithm  that  achieves  this  forward 
move  by  carrying  along  in  parallel  all  possible  parses  of 

J 
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the  forward  text,  as  long  as  all  parses  agree  as  to  the 
next  move  to  make,  and  no  parse  refers  to  context  to  the 
left  of  the  error  point.  For  this  algorithm  we  have  not 
states  but  sets  of  states  appearing  on  the  stack.  (In 
Chapter  6 we  convert  the  sets  of  states  to  states  them- 
selves and  recode  the  algorithm  so  that  it  is  practical.) 

The  algorithm  has  two  initialization  steps,  followed  by 
repeated  parse  steps. 

Forward  Move  Algorithm  (FMA) 

Push?:  Push  ? = K on  an  empty  stack. 

Readh:  Let  h = head  of  input. 

Push  (q1  | q — > q’  and  q e ?} 
on  the  stack.  Read  h. 

Parse  repeatedly  according  to  the  following  rules: 
Let  h = head  of  input,  Q = state  set  on  top  of  stack. 
Let  PD  = q^Q  PD  (q , h)  . 
do  case  PD: 

case  {read}:  Read  h and  push 

{q'  | q > q'  and  q e Q} . 
case  {A  w}  : Perform  a reduction: 

Ensure  that  there  are  at  least  |w|  state 
sets  on  the  stack  following  the  ? (i.e. 

ensure  that  the  entire  right  hand  side  w 
resides  on  the  top  of  the  stack) . 

If  not,  halt. 


Otherwise,  pop  |w|  state  sets  off  the  stack. 
Let  Q be  the  new  top  of  stack. 

Push  { q ' | q — > q ' and  q e Q} . 
case  {}:  Halt,  signalling  an  error. 

case  {accept}:  Halt;  we  have  consumed  all  but 

the  J_. 

case  otherwise  (i.e.  | PD | > 1):  Halt. 

end  FMA 

FMA  essentially  follows  all  paths  starting  at  any  state  in 
the  CFSM  that  allow  the  parsing  of  the  input  text,  halting 
(1)  when  two  different  paths  end  up  in  states  that  disagree 
as  to  how  to  continue  the  parse  (this  difference  is  caught 
in  case  "otherwise"  of  FMA) , (2)  when  all  paths  end  up  in 

states  requiring  a reduction  over  the  ? (case  {A  -*•  w} ) , 

(3)  when  we  read  the  entire  input  (case  {accept}),  or  (4) 
when  we  encounter  another  error  (case  {}),  i.e.  no  path 
can  be  continued. 

We  illustrate  the  halts  of  case  {A  -+■  w}  and  case 
"otherwise"  by  Examples  1 and  2 below,  where  the  grammar 
involved  is  a simple  arithmetic  expression  grammar. 

Figure  2 contains  the  grammar  and  its  CFSM  augmented 
with  LALR(l)  look-ahead  sets. 

Example  _1 . Let  the  erroneous  input  string  be 
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LRPA  stops  with  state  stack 
[ i ] 

The  following  displays  the  execution  of  FMA  on  the 
remainder  of  the  input 


FMA  step 
just  made 

Stack  after 
FMA  step 

Rest  of  input 

Push? 

? 

( 

i ) 1 

Readh 

? 

<<0> 

i 

) i 

{ read} 

? 

{(0} 

{io} 

) 

1 

(P  - i> 

? 

H0> 

{po} 

) 

1 

{T  -*■  P} 

? 

{(0) 

<v 

) 

1 

{E  -*•  T} 

? 

{(0> 

{Ej^} 

) 

1 

{ read} 

? 

u0} 

(Ei)  {)„> 

1 

{P  (E)  } 

? 

{p0} 

1 

{T  -*■  P} 

? 

{T0' 

T1'T2> 

1 

The  algorithm  halts  here  because 

PD  (Tq  » J_)  U PD(T1,1)U  PD(T2,J_) 
= {E  ->  E + T,  T P **  T,  E 

Example  ,2.  Input  is  ()  J_. 
LRPA  halts  with  state  stack  [(]. 


FMA  step 

Push? 

Readh 


Stack 

? 

. QJ 


Rest  of  input 


Halt:  PD  0 0,J[)  = {P  ->  ( E )},  and  there  are  less  than 

three  items  on  the  stack  above  the  ?. 


16 


' 


Li 


In  Example  1,  we  face  the  possibilities  of  reducing 
by  three  different  productions.  E -*■  T is  the  proper 
reduction  only  if  what  immediately  precedes  the  T is  a 
" ("  or  the  start  state;  E ->  E + T is  the  proper  reduc- 
tion only  if  what  immediately  precedes  the  T is  "E  +" ; 
and  t -*■  P **  T is  correct  only  if  "P  **"  precedes  the 
T;  the  ? to  left  indicates  no  knowledge 

of  what  precedes  the  T.  Thus  we  cannot  continue  parsing 
without  making  a guess,  and  must  halt.  In  effect  the  three 


different  places  in  the  CFSM  in  which  a T can  be  read 
yield  three  different  decisions  as  to  what  to  do  with  the  T. 

In  Example  2,  we  attempt  to  reduce  with  P -*■  ( E ), 
but  find  that  "(  E"  does  not  precede  ")"  on  the  stack. 
The  attempted  reduction  gives  us  an  indication  of  what  the 
user  intended,  however,  and  provides  useful  information 
for  an  error  recovery  algorithm,  as  we  shall  see  later. 

The  second  initialization  step  Readh  of  FMA 
guarantees  that  the  algorithm  produces  a forward  context 
of  length  at  least  one.  If  we  did  not  force  FMA  to  read 
the  first  symbol,  then  it  might  also  consider  reductions 
that  have  the  first  input  symbol  in  their  look-ahead  sets; 
possible  choices  between  a read  and  some  reductions  might 
have  caused  FMA  to  halt  immediately  in  case  "otherwise", 
making  no  progress  whatsoever.  (We  assume  also  for  the 
remainder  of  this  paper  that  we  never  invoke  FMA  on  the 
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input  consisting  only  of  J_,  otherwise  we  would  immediately 
read  J_  in  step  Readh.) 

FMA  computes  state  sets  dynamically;  there  is  no 
reason  why  these  state  sets  and  the  transitions  between 
them  cannot  be  precomputed,  resulting  in  an  FSM.  This 
is  formalized  in  Chapter  6.  Meanwhile,  we  can  use  Chapter 
6's  results  to  extend  the  concepts  of  transitions  and  paths 
to  FMA's  state  sets.  Hence,  if  FMA  consumes  forward 
text  u from  string  uv  and  produces  forward  context  U, 
we  may  write  (?,uv)  |—  ([?:U],v).  U represents  a 

"condensed"  or  "partially  parsed"  version  of  u:  U -►+  u 
(we  may  write  U ->-+  u instead  of  U ■+•*  u since  |u|  > 1). 
To  prevent  confusion  between  LRPA  and  FMA,  we  prefix 
moves  of  FMA  by  "FMA:",  as  in  FMA:(?,uv)  |—  ([?:U],v). 


THE  WEAK  VALID  FRAGMENT  PROPERTY  AND  FMA 


string  uv  from  which  FMA  reads  u 


weak  valid 


First,  we  define  the  "valid  fragment 


property"  and  then  weaken  it.  Informally,  for  some  suffix 


U e V*  is  a "valid  fragment"  of  uv 


such  that  S 


That  is,  if  S 


yuv  not  only  must  u be 


derived  from  U 


then  the  grammar  is  ambiguous) , but  the  derivation  step 
deriving  yUv  must  involve  the  last  symbol  of  U.  We 
define  this  formally  in  terms  of  parser  actions: 


U e v*  is  a valid  fragment  of  uv  iff  U -*•  u and  for 


every  valid  prefix  y such  that  (ly],uv) 


In  ocher  words,  any  state  stack  [y]  satisfying  the 


conditions  of  the  definition  must  cause  LRPA  to  read  all 
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of  u and  develop  the  valid  fragment  U on  its  state 
stack,  i.e.  reduce  u to  U. 

In  the  context  of  error  recovery,  this  concept  has 
the  following  significance:  Suppose  LRPA  encounters  an 

error  and  halts  in  configuration  (Z,uv),  with  uv  a 
suffix  of  a sentence.  (We  deal  with  the  case  where  uv 
is  not  a suffix  in  Chapter  5.)  Let  us  propose  that  by 
substituting  [y]  for  Z we  could  cause  LRPA  to  accept. 
How  could  we  verify  this  proposition?  By  running  LRPA, 
to  be  sure.  But  if  we  had  many  such  strings  [y]  to  try, 
running  LRPA  could  be  costly.  Now,  suppose  that  we  had 
some  valid  fragment  U of  uv.  A necessary  (not  suffi- 
cient) condition  that  ([y],uv)  )—  accept  is  that  a path 

starting  at  Top  [y]  spells  out  U,  i.e.  there  exists 
some  path  [Top[y]:U],  implying  that  (since  U -*■*  u) 
([y],uv)  |-  ( [y] [Topty] :U] ,v)  = ( [yU] ,v) . Thus,  valid 

fragments  give  us  a useful  tool  with  which  to  limit  our 
selection  of  [y]'s. 

It  turns  out  that  since  FMA  reads  as  its  first 
step,  the  forward  context  U that  it  provides  does  not 
quite  satisfy  the  valid  fragment  property.  It  is,  however, 
a "weak  valid  fragment"  and  can  be  used  in  a testing  pro- 
cedure similar  to  that  described  above.  Informally,  for 
some  suffix  uv  of  a sentence,  U e V*  is  a "weak  valid 
fragment"  of  uv  iff  U ■**  u and  for  every  y such 
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that  S -+•*  yuv,  there  exists  y ' e V*  such  that 
S -*-*  y'Uv  -*•*  y'uv  -*-*  yuv, 

and  y'U  is  a proper  prefix  of  the  characteristic  string 
of  y’Uv.  That  is,  if  S -*■*  yuv,  not  only  must  u be 
derived  from  U,  but  there  exists  a y'  such  that 
y'  -+*  y and  the  derivation  step  producing  y'Uv  involves 
the  rightmost  symbol  of  U.  Formally: 

Definition . For  some  suffix  uv  of  a sentence, 

U e V*  is  a weak  valid  fragment  (WVF)  of  uv  iff  U -*•*  u 
and  for  every  valid  prefix  y such  that  ( [y ] ,uv)  |— 
accept , there  exists  a y 1 e V*  such  that 

(lylfUv)  |—  ( [y ' ] ,uv)  | — ( [y ' U] , v) . 

In  other  words,  any  state  stack  [y]  that  causes 
LRPA  to  accept  uv  must  cause  it  to  reduce  [y]  to  some 
[y1],  read  all  of  u and  develop  the  weak  valid  fragment 
U on  its  state  stack.  We  shall  prove  that  the  forward 
context  returned  by  FMA  satisfies  the  WVF  property. 

The  reason  for  the  complication  of  reducing  [y]  to  [y * ] 
.Ls  because  FMA  does  not  consider  reducing  as  its  first 
move. 

Suppose  now  that  LRPA  encounters  an  error  in  con- 
figuration (Z,uv),  and  that  uv  is  a suffix  of  a 
sentence.  If  we  propose  that  replacing  Z by  [y]  could 
cause  LRPA  to  accept,  the  forward  context  U of  uv 


i 
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provided  by  FMA  gives  us  a necessary  condition  on 
the  validity  of  [y]  as  a replacement.  Uy],uv)  |— 
accept  only  if  there  exists  y'  such  that  ( [y] ,uv) 

|—  ([y']/Uv)  (by  a series  of  reductions),  and  there 
exists  a path  from  Top[y']  that  spells  out  U,  i.e. 

( [y  * ] >uv)  | — ( [y ' ] [Top[y'] :U] ,v)  = ([y'U],v). 

We  shall  now  show  that  the  U returned  by  FMA 
satisfies  the  WVF  property.  In  Lemma  1 we  explore  the 
nature  of  the  state  sets  manipulated  by  FMA.  We  use  this 
lemma  to  prove  Theorem  1,  which  establishes  the  WVF 
property  as  a corollary.  Theorem  2 gives  us  the  additional 
result  that  FMA  in  some  sense  tries  as  hard  as  it  can 
by  consuming  the  longest  possible  forward  text.  Theorem  2 
is  not  essential  to  our  error  recovery  techniques  but 
reassures  us  that  the  techniques  perform  as  well  as  they 
can. 

Lemma  1 captures  the  fact  that  if  LRPA  starting 
with  any  left  context  on  its  stack  makes  the  same  series 
of  moves  as  FMA  does  in  parsing  string  uv,  then  FMA 
has  kept  track  of  LRPA's  state  stack  in  its  state  sets. 

Lemma  1.  Suppose  FMA:(?,uv)  |—  ...  |—  (?Q, 

...  Q , v)  . If  ( [y  ’ ] ,uv)  |-  . ..  |rr  (Z,v),  then 

nl  r 

Z ■ [y'l  qx  q2  ...  qm/  where  qi  e Qi , 1 £ i * m. 
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Proof.  By  induction  on  r.  For  r = 1 : = read 

by  step  Readh  of  FMA,  and  FMA  has  stack  ? . 


LRPA , after  making  move  Mn , has  stack  [y']  q.  , where 

ui 

q^L  = SIGMA  (Top  [y ' ] , u1 ) . Now  Q = {q'  | q — — > q'  and 
q e K}  by  Readh,  hence  q^  e . 

Assume  the  hypothesis  true  for  r = k;  thus  FMA 


has  halted  with  stack  ? 


Qm,  and  LRPA  has 


stack  [y ' ] q^  q£  •••  qm«  Consider  move  M^+^. 

(1)  M^_+1  = read;  let  the  symbol  to  be  read  be  s. 
Then  LRPA  pushes  state  q^+^  = SIGMA (qm,s) 
by  case  {read}  of  the  parsing  algorithm. 


FMA  pushes  state  set  Qm+1  = {q1 


->  q' 


and  q e Q } . 

m 

qm+l  E Qm+1‘ 


But  since  q e Q , 
m m 


(2)  Mfc+1  - A * „ 

FMA  pops  | w | state  sets,  leaving  stack 

? Qx  Q2  ...  Qm_|w| 

where  m - |w|  > 0 (since  there  are  at  least 


| w ) state  sets  above  ? on  the  stack).  It 

A 


then  pushes  Q^_ | w I +1  = <q' 


q'  and 


qeQm-|w|K  LRPA  pushes  state  q;_|w|+1  = 
SIGMA (qm_ | w | ,A)  on  the  stack.  Since 

by  the  inductive  hypothesis. 


q i i e Q | 
nm- | w | vm-  w 


qm-|w|+l  f ^m-|w|+l' 


II  . 


_ m 


I 


(3)  = accept;  the  stacks  remain  the  same  for 


both  FMA  and  LRPA . 


Theorem  1.  Suppose  FMA:(?,uv) 


• • • I ( [?  :U]  ,v)  . 


Let  h = head (v) . Then  for  every  y and  Z such  that 
( [y] ,uv)  |—  (Z,v)  and  PD (Top  Z,h)  is  either  {read}  or 
{accept},  there  exists  y'  such  that  (fy],uv)  |— 

([y'l ,uv)  |±  ( [y ’ U] ,v) . 


Proof.  Choose  some  y and  Z such  that  (ty],uv) 

|—  (Z,v)  and  PD  (Top  Z,h)  is  either  {read}  or  {accept}. 
We  let  [y 1 ] be  such  that  ( [yl ,uv)  reduces  to  ( [y1 ] ,uv) . 
Thus,  ((y],uv)  ((y']»uv)  and  the  first  move  LRPA 

takes  out  of  configuration  ([y'],uv)  is  read  (=M^) . 

We  now  prove  by  induction  on  r that  LRPA's  next  r 
moves  from  configuration  ([y'],uv)  are  ...  M^. 

For  r = 1 : = read  by  step  Readh  of  FMA.  We 

know  that  LRPA  must  read  as  its  first  move  from  configu- 
ration ([y'],uv),  by  our  definition  of  y' . Now  let 

the  theorem  hold  for  r = k.  By  Lemma  1,  FMA's  stack 
after  move  is 

7 Q1  °2  * * ' Qm 

and  LRPA's  stack  after  move  is 

ly'l  qj  q2  •••  qm 

where  e Q^,  1 & i £ m.  Let  the  next  symbol  in  the 

input  be  s (s  is  either  in  u or  is  the  first  symbol 
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of  v)  . FMA  now  makes  move  Mk+1-  Consider  LRPA's 
possible  actions: 

(1)  It  makes  no  move  at  all. 

If  s is  in  u,  then  this  case  is  impossible 

since  ( [y] ,uv)  |-  (Z,v).  If  s = h,  then  if 

s / J_,  then  LRPA  must  be  able  to  move  since 

PD (Top  Z,h)  = {read}  or  {accept}  implies 

that  LRPA  eventually  accepts  or  reads  h; 

if  s = J_,  then  the  only  way  LRPA  cannot 

move  is  if  its  previous  move  was  accept;  but 

then  FMA's  previous  move  (by  induction) 

would  have  been  accept,  and  it  cannot  then 

make  move  M,  . 

k+1 

(2)  It  makes  move  M^.+^  7*  Mk+1' 

Then  M£+1  = PD (qm,s) . 


But  then 


u 

q g q, 


PD(q,s)  would  contain  both 


m 


M,  . and  M,'  . . since  q e Q . Hence  by 
K+i  k+1  m m 

case  "otherwise"  of  FMA,  FMA  would  not  make 


move  M, 


This  contradicts  the  fact  that 


k+1  ‘ 

FMA  makes  move  M,  , , . 

k+1 

Thus  we  have  shown  that  the  next  r moves  LRPA 
makes  from  configuration  Uy'],uv)  are  M1  ...  Mr . But 
by  Lemma  1 , 


FMA: (? ,uv)  |- 


Ifi  Q2  ...  Qm.  v) 

r 


where  q.  e Q. , 1 < i < m.  Since  ? Q,  Q-  ...  Q = 

12m 

[?:U],  [y']  qx  q2  ...  qm  = [y'J  [Top[y']:U]  = [y'U]. 

Corollary.  If  ([y],uv)  |—  accept , then  there  exists 

y'  such  that  ( [y] ,uv)  |-  ([y'],uv)  |-  ([y'U],v)  (the 

WVF  property  for  U)  . 

Proof . If  ( [y] ,uv)  |—  accept , then  there  exists 
Z such  that  Uy],uv)  |—  (Z,v)  and  PD (Top  Z,head(v))  = 
{read}  or  {accept}.  The  corollary  now  follows. 

The  next  theorem  is  not  essential  to  the  correct 
fragment  property,  but  reassures  us  that  FMA  goes  as 
far  as  it  can  without  making  a decision  based  on  context 
to  the  left  of  the  ? state  set. 

Theorem  2^.  Consider  suffix  uv  of  a sentence.  If 
there  exists  a sequence  of  moves  ...  M^.  (r  > 1)  such 

that 

(i)  = read, 

(ii)  there  exists  a valid  prefix  y such  that 

( [y]  .UV)  |-  . . . I ( tyU]  ,v) 

1 r 

and  LRPA  never  pops  any  of  [y]  from  the 


state  stack. 


(iii)  there  do  not  exist  valid  prefixes  y,  y' 
and  k < r such  that 

Uy],uv)  |s  ...  |jj  (Z,R)  |s  (Z ' ,R'  ) 

([y'  ] »UV)  |jj  ...  |-  (Y,R)  (Y',R") 

and  Mk+1  i*  M’+1, 
then  FMA:  (?  ,uv)  |jj  •••  Im 

r 

Proof . By  induction  on  r.  For  r = 1:  FMA  makes 
move  M^  = read  by  step  Readh.  Let  the  theorem  hold  for 
r = k,  and  let  y be  the  valid  prefix  of  hypothesis  (ii) . 
By  Lemma  1 , 

FMA:  (?  ,uv)  |-  ...  |jj  (?  Q1Q2  ...  Qm,  R) 

1 k 

and 

Uyl»uv)  |-  . ..  |jj  ([y]qx  q2  •••  R) 

1 k 

where  q^  e Q^.  Let  the  next  symbol  of  input  be  s (s 
is  either  in  u or  is  the  first  symbol  of  v) . We  show 
that  FMA  makes  move  Mk+^.  Consider  the  possible 
actions  of  FMA. 

(1)  is  A -+•  w,  but  FMA  cannot  make  that 

move  because  there  are  less  than  |w|  state 
sets  following  ? on  the  stack.  This  contra- 
dicts hypothesis  (ii) : LRPA  would  then  have 

to  pop  some  of  [y]  from  the  state  stack. 


m 


m 


-7 
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(2)  FMA  makes  some  move  M^+1  ^ Mk+1*  But  by 

Lemma  1,  M,  . e W PD(q,s)  and  thus  FMA 
K+i  cj  t y 
^ m 

has  a choice  of  at  least  two  moves  to  make. 


Thus  FMA  cannot  make  move  M'  , . 

k+1 


(3)  FMA  halts  due  to  another  error,  i.e. 


q e Q 


PD  (q , s ) = {}.  This  cannot  occur. 


m 


\J 


since  by  Lemma  1 M,  . . e _ PD(q,s). 

J k+1  q e Q ^ 

m 


(4)  FMA  halts  in  case  "otherwise"  because  it  has  a 
choice  between  two  or  more  moves  (one  of  them 
Mk+i ) . Let  one  of  the  moves,  different  from 
M^+^ , be  M^+^.  Then  there  is  some  path 
qQf  q-j^  »..•/  qm  such  that  q..  e Qit  1 < i < m, 
PD(qm,s)  = (M^+i } , and  qQ  e ?.  Let  y' 
access  q^.  Then  for  some  Y,  Y' , and  R' ' , 

([y’],uv)  |-  ...  |-  (Y , R)  I-,  (Y ' , R' ' ) . 

”l  k k+1 


This  contradicts  hypothesis  (iii) . 


We  have  shown  possibilities  (1)  through  (4)  to  be 
contradictory,  thus  the  only  possibility  left  for  FMA  is 
to  make  move  anc*  tbe  inductive  step  is  proved. 


Theorem  2 is  somewhat  tedious,  but  proves  that  FMA 
simulates  LRPA  in  all  the  (possibly  infinite)  situations 
in  which  LRPA  has  already  parsed  some  valid  prefix  y 
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that  causes  LRPA  to  read  head(u).  Thus  the  ? state 
set  can  really  be  regarded  as  representing  the  set  of  all 
such  valid  prefixes. 

Parts  1 and  4 of  the  case  analysis  demonstrate  how 
FMA  proceeds  without  any  knowledge  of  left  context.  In 
part  1,  reducing  would  cause  FMA  to  interrogate  context 
to  the  left  of  ? to  determine  what  state  to  go  to  on  non- 
terminal A.  In  part  4,  we  have  2 or  more  choices  for  FMA; 
the  correct  choice  depends  on  left  context.  Parts  2 and  3 
capture  the  fact  that  the  choices  FMA  is  presented  with 
contain  all  choices  that  LRPA  would  ever  consider. 


In 

summary, 

if 

LRPA 

encounters  an  error 

in  configu- 

ration 

(Z  ,uv)  , 

and 

FMA 

reads  u from  uv 

producing 

forward 

context 

u. 

we  know  that  FMA  makes 

as  many  moves 

as  possible  and  U satisfies  the  WVF  property.  We  can 
verify  that  some  proposed  replacement  of  [y]  of  Z 
satisfies  a necessary  condition  for  ( [y] ,uv)  |—  accept 
by  the  following  process,  which  we  call  CHECK_VALID  and 
which  takes  as  arguments  [y]  , uv,  and  U: 

CHECK_VALID 

Determine  y'  such  that  Uy],uv)  reduces  to 
( [y ’ ] ,uv) . (Note;  there  may  not  be  any  such  y', 
in  which  case  we  fail,  i.e.  ( y ] is  unsatisfactory.) 
Determine  that  a path  (Top[y'):U)  exists.  This 
can  be  accomplished  in  the  following  fashion: 


Let  Stack  = [y ' ] . 
for  i :=  1 to  m do 


if  Top  Stack  — — > q exists  for  some  q 
then  push  SIGMA (Top  Stack, a.)  on  Stack 
else  we  fail 

We  succeed  if  the  for  loop  runs  to  completion, 
end  of  CHECK_VALID 

CHECK_VALID  is  a simple,  efficient  test  to  check  the 
viability  of  a proposed  stack  repair.  The  essential  tactic 
that  guarantees  this  result  is  that  forward  moves  never 
proceed  after  FMA  encounters  an  inadequacy,  a reduction 
over  ?,  or  another  error.  Making  some  arbitrary  choice 
between  the  alternatives  in  an  inadequate  transition  in  an 
attempt  to  continue  parsing  is  a serious  mistake;  it  makes 
an  unwarranted  assumption  about  the  context  to  the  left  of 
the  error  point.  The  assumption  has  no  foundation  and  is 
just  a guess  that  may  be  wrong. 

The  WVF  property  of  a forward  context  U gives  us 
the  CHEC K_VAL I D procedure,  but  there  is  still  another 
property  of  U that  can  aid  error  repair.  If  uv  is 
the  suffix  of  some  sentence,  and  FMA:(?,uv)  |—  (t?:U],v), 
then  FMA  cannot  halt  in  case  {}  (the  error  step):  the 

possible  set  of  moves  PD  will  never  be  empty.  The  moves 
in  PD  can  give  us  information  relating  to  the  class  of 
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valid  prefixes  y such  that  S -*-*  yuv.  We  elaborate  on 
the  use  of  this  information  in  the  subsequent  chapter,  but 
prove  a property  about  it  here.  Theorem  3 states  the 
property  and  needs  Lemma  2 for  its  proof. 

Lemma  2 . Let  FMA:(?,uv)  | — ([?:U],v)  = (PQ^  ... 

Qm,  v) . For  any  path  [y'U]  in  the  CFSM  where  p = 

Top [y 1 ] , [Top [y ' ] :U]  = p q1  ...  qm  and  e Q 

1 < i < m. 

Proof.  Let  p = Top[y']  and  U = a.  ...  a . 

1 m 

p e ?,  hence  by  step  Readh  or  case  (A  w}  of  FMA, 

a, 

q^  = SIGMA (p,a^)  c = {q'  | q > q'  and  q e ?} . 

By  a simple  induction  on  m,  we  have  the  result. 

Theorem  .3.  Let  uv  be  a suffix  of  a sentence, 

FMA:  (?  ,uv)  |-  ( [?  :U]  ,v)  , and  PD  = Top  [?  ;(J]  PD  (o, 

head(v)).  Then 

(1)  | PD | > 1 

(2)  For  every  y',  Z',  v',  M, 

([y'U],v)  |—  (Z',v')  implies  that  M e PD. 

Proof . If  PD  were  empty,  then  there  could  be  no 
y such  that  S -*■*  yuv,  and  hence  uv  would  not  be  a 
suffix  of  a sentence.  Hence  conclusion  (1).  Consider  now 
[y'U].  Topfy'U]  e Top[?:U]  by  Lemma  2,  hence 
PD (Top [y ' U] ,head (v) ) e PD.  Hence  conclusion  (2). 


......  I . I I 
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Thus  if  LRPA  halts  in  some  configuration  (Z,uv), 
and  uv  is  a suffix  of  a sentence,  applying  FMA  to  uv 
yields  a set  of  moves  PD  such  that  if  we  propose  some 
substitution  [y]  for  Z,  there  must  exist  some  y' 
such  that  ([y],uv)  | — ([y'U],v)  |-  (Z’,v'}  and  M e PD. 
Suppose,  for  example,  that  PD  = {A  -*•  w}  and  that 
|w|  > |U(  . Then  M = A -*■  w,  and  some  suffix  y'  ' of 
y'  must  be  such  that  y' 'U  = w.  Hence  we  know  something 

explicit  about  y'.  We  delay  application  of  this  until 
the  next  chapter.  We  call  the  property  guaranteed  by 
Theorem  3 the  "next  move"  property. 

In  summary,  we  have  shown  the  following  three 
properties  to  hold  of  FMA  when  applied  to  a sentence 
suffix:  the  returned  forward  context  is  a WVF;  it  parses 

ahead  as  far  as  possible;  and  it  halts  with  a non-empty 
set  of  moves,  one  of  which  must  be  taken  next.  We  have 
seen  how  the  first  property  yields  an  efficient  algorithm 
for  validating  proposed  error  repairs,  and  have  hinted  at 
the  value  of  the  next  move  property.  In  the  next  chapter 
we  learn  how  to  use  FMA  to  gather  forward  context  in 
particular  error  situations  and  how  to  use  the  next  move 
property  as  an  aid  to  error  repair. 

We  emphasize  finally  that  the  results  of  this  chapter 
do  not  define  any  error  recovery  strategy,  but  merely 
provide  useful  tools  that  any  strategy  may  use. 


Chapter  5. 

USING  FMA  IN  AN  ERROR  REPAIR  ALGORITHM 


L 


In  this  section  we  concern  ourselves  with  determining 
the  best  way  to  use  FMA  to  gather  forward  context  in 
conjunction  with  some  error  repair  strategy.  As  mentioned 
in  the  introduction,  we  restrict  ourselves  at  first  to 
considering  only  a single  occurrence  of  one  of  three  types 
of  errors:  insertion,  deletion,  and  replacement  of  a 

single  terminal  symbol.  We  note  that  the  errors  in  the 
sample  test  program  of  Graham  and  Rhodes  [G&R  75]  are  all 
of  this  type.  We  call  this  assumption  the  "simple  error 
assumption . " 

We  can  describe  these  three  situations  in  the  follow- 


ing  manner  (x,  z e T* , 

and 

s. 

t e 

T)  : 

Insertion  error: 

S 

* 

-*• 

xz 

but 

S-fr* 

xtz. 

Deletion  error: 

S 

->* 

xtz 

but 

xz . 

Replacement  error 

: S 

xsz 

but 

xtz . 

We  view  an  error  recovery  algorithm  as  being  composed 
of  two  phases:  (1)  the  gathering  '>f  forward  context,  and 

(2)  the  application  of  an  error  repair  strategy  (which  uses 
the  forward  context).  Given  the  simple  error  assumption, 
we  first  investigate  how  use  FMA  to  acquire  forward 
context.  Then  we  show  how  an  error  recovery  strategy  might 
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use  the  forward  context,  leaving  the  error  recovery 
strategy  itself  unspecified,  but  providing  general  hints 
as  to  how  it  might  work. 

Gathering  forward  context . We  investigate  the 
different  situations  LRPA  encounters  when  it  detects 
an  error,  determine  how  best  to  gather  forward  context  in 
each  case,  and  develop  an  overall  strategy  based  upon  the 
case  analysis. 

In  the  insertion  case,  LRPA  may  detect  the  error 
before  or  after  reading  t,  i.e.  it  may  halt  in  confi- 
guration (Z,tz)  or  in  (Z,z')  (z*  is  a suffix  of  z) . 

In  the  latter  case,  the  inserted  symbol  t has  been 
absorbed  into  the  left  context  z.  The  possibilities 
are  the  same  for  the  replacement  case.  In  the  deletion 
case,  LRPA  halts  in  (Z,z')  (again,  z'  is  a suffix 
of  z) . We  consider  halting  configurations  (Z,tz)  and 
(Z,z!)  separately. 

We  distinguish  between  the  concepts  error  and  error 
symptom.  When  LRPA  encounters  an  unexpected  symbol 
(case  {}  of  the  LR  algorithm),  we  say  that  it  detects 
( the  existence  of)  an  error  and  that  the  symptom  of  the 
error  is  that  LRPA  fails  on  the  symbol.  It  is  the  goal 
of  error  recovery  to  eliminate  the  symptom. 


Case  (Z , tz) . 


(We  have  an  insertion  or  replacement 
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error) . 

Where  Graham  and  Rhodes  and  Druseikis  and  Ripley 
resume  the  parse  by  immediately  performing  a forward  move, 
we  do  not.  Since  the  symbol  t heads  the  input,  such  an 
action  would  necessarily  start  us  off  in  the  wrong  context. 
We  instead  delete  the  t from  the  input,  and  then  invoke 
FMA.  Since  our  simple  error  assumption  guarantees  that 
z is  a sentence  suffix,  the  forward  context  developed  is 
both  a WVF  and  satisfies  the  next  move  property. 

Case  (Z , z 1 ) . Either  LRPA  has  absorbed  t on  its 
stack  (in  the  replacement/insertion  case) , or  a deletion 
error  occurred.  LRFA  halts  in  configuration  (Z,z'). 

Since  t has  been  absorbed  onto  the  stack,  z'  is 
a sentence  suffix  and  we  merely  submit  z'  to  FMA. 

Combining  the  case  analyses . Since  we  cannot  know 
a priori  which  case  is  the  actual  circumstance  when  an 
error  is  detected,  we  must  combine  case  strategies  into 
one.  This  combination  works  as  follows:  Not  knowing 

whether  the  unexpected  symbol  is  in  error  or  not,  we 
always  initially  skip  over  it,  then  perform  the  forward 
move.  By  the  assumption  that  the  program  is  mutilated 
by  only  a single  error,  this  forward  context  is  derived 
from  a sentence  suffix.  Then  we  determine  if  the  un- 
expected symbol  can  te  attached  to  the  front  of  the 
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already  developed  forward  context.  If  it  can,  it  is  most 
likely  not  in  error  (we  are  thus  in  case  (Z,z'));  if  it 
cannot,  then  with  an  exception  (case  "otherwise"  of  RCA 
below)  we  are  most  likely  in  case  (Z,tz).  Therefore, 
assume  LRPA  halts  in  configuration  (Z,suv),  where 
s e T,  u,  v e T+  (uv  is  a sentence  suffix) . Compute 
the  forward  context  by  the  following  algorithm: 

Right  Context  Algorithm  (RCA) . 

Determine  U such  that  FMA:(?,uv)  |—  ([?:U],v). 

Then,  try  to  attach  s to  the  front  of  U as 

follows : 

Determine  s'  such  that  (?,suv)  |—  (f?:s],uv) 

|—  (t?:s'],uv)  where  FMA  has  made  as  many 
moves  as  it  can  without  reading  head(u). 

Let  PD  = qV  Top  [ ? : s ' ] PD(q»head(u)). 

do  case  PD : 

case  {read}:  Determine  if  path  [?:s'U]  exists, 

suv  is  a sentence  suffix  only  if  the  path 
exists.  If  it  does  not,  then  discard  s; 
we  are  in  case  (Z,tz),  with  s = t.  If 
it  does,  then  try  to  continue  the  forward 
move  farther  from  configuration  ([?:s'U], 
v) , i.e.  determine  U'  such  that 
( [ ? : s ' U]  , v)  |-  ( [? :U' ] , v 1 ) . 


It  is  likely  (but  not  certain)  that  we 
are  in  case  (Z , z ' ) . 
case  { } : 

We  may  conclude  that  s is  a bad  symbol, 
and  discard  it;  we  are  in  case  (Z,tz), 
with  s = t . 

case  otherwise  (i.e.  |pd|  > 1): 

We  cannot  conclude  anything  definite  about 
s.  We  then  end  up  with  two  forward  con- 
texts , (s']  and  [U]  . 

end  RCA 

RCA  sometimes  can  tell  us  whether  we  are  in  case 
(Z,tz)  or  (Z , z ' ) , and  in  most  situations  produce  a 
single  forward  context  with  which  to  validate  potential 
error  repairs.  The  single  exception  is  case  "otherwise" 
where  we  have  two.  But  in  all  situations  we  have  a forward 
context  (U  or  U * ) that  is  a WVF  and  satisfies  the 
next  move  property,  since  it  is  the  result  of  applying 
FMA  to  a suffix  of  some  sentence. 

This  completes  the  discussion  of  how  to  gather  right 
context. 

Error  repair  suggestions . We  do  not  intend  to 
provide  a complete  error  repair  strategy.  Rather,  we 
offer  only  some  suggestions  and  indicate  how  the  forward 
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context  might  be  used  to  aid  the  strategy. 

In  case  (Z,tz),  the  obvious  thing  to  try  is  the 
deletion  of  the  unexpected  symbol  and  the  replacement  of 
it  with  all  other  terminals;  the  former  is  achieved  by 
applying  CHECK_VALID  to  the  existing  stack  Z and  the 
forward  context  U of  RCA,  the  latter  by  applying  it 

: 

to  Z modified  by  appending  to  it  all  possible  terminals. 
Given  the  simple  error  assumption  we  must  be  able  to  hit 
upon  the  proper  correction. 

In  case  (Z,z')»  since  t (or  its  absence)  is 
buried  in  the  stack  Z,  complex  stack  modifications  may 
be  required  to  repair  the  error.  We  illustrate  this 
with  the  following,  where  a deletion  causes  LRPA  to 
erroneously  reduce  the  stack  into  a "higher  context." 
Suppose  that  the  text  "if  I=K  then  I=J  else  I=L" 
were  altered  to  "I=K  then  I=J  else  I=L".  Rather 
than  detecting  the  deletion  of  "if",  LRPA  assumes  it 
is  parsing  an  assignment  statement,  and  halts  when  the 
unexpected  "then"  is  encountered,  in  configuration 
( [Lef  t_part  = Exp]  , then  I=J  else  I=L_[_)  . Were 
the  "if"  not  omitted,  upon  encountering  the  "then" 

LRPA  would  be  in  configuration  ([if  Bexp] , then  I=J 
else  I=Lj_;  . We  need  a stack  modification  to  transform 
[Left  __part  = Exp]  to  [if  Bexp]  --  a tall  order.  This 


example  illustrates  how  erroneous  reductions  greatly 


complicate  error  repair. 

(The  occurrence  of  such  erroneous  reductions  is  the 
reason  that  we  are  not  convinced  of  the  efficacy  of  the 
backward  move  espoused  by  Graham  and  Rhodes  and  Druseikis 
and  Ripley.  The  backward  move  seeks  to  cause  just  such 
reductions . ) 

To  aid  the  invention  of  stack  repairs,  we  suggest  the 
use  of  the  next  move  property,  which  says  that  after  FMA 
halts,  we  have  a set  PD  of  moves,  one  of  which  must 
eventually  be  made.  (Although  if  erroneous  reductions 
have  occurred,  the  "easiest"  repair  may  not  include  any 
of  those  moves.)  If  FMA  terminates  in  case  {A  w} 
with  an  attempted  reduction  over  ?,  then  (given  our 
simple  error  assumption)  we  know  what  phrase  was  intended 
and  what  move  we  must  make  (viz.,  the  reduction).  In  the 
previous  example,  suppose  the  forward  context  Uc  computed 
in  case  (read)  of  RCA  was  such  that  (?,  then  i>J  else 
1=  Lj_)  | — ((then  Stmt  else  Stmt]  ,_L)  , with  FMA  halting 
in  case  {If_stmt  -*•  if  Bexp  then  Stmt  else  Stmt);  we  then 
know  that,  in  this  example,  [Leftjpart  = Exp]  must  be 
modified  to  "if  Bexp".  After  thus  modifying  the  stack  we 
can  effect  the  reduction  and  resume  normal  parsing.  As  an 
approximation  to  this  we  could  simply  search  for  some  state 
preceding  the  ? that  reads  the  nonterminal  If_stmt; 
after  finding  such  a state  q we  delete  all  states  on  top 
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of  it  and  push  SIGMA  (q , If __stmt)  on  the  stack.  In  the 
example,  we  would  delete  the  top  three  states,  leaving 
only  the  start  state  (=q)  on  the  stack,  and  resume 
parsing  in  the  new  configuration  ( [If_stmt]  ,]_)  . While 
not  correcting  the  actual  error,  we  in  effect  modify 
LRPA's  stack  so  that  it  behaves  as  if  the  error  were 
corrected.  We  call  this  technique  SF  for  "stack  forcing", 
because  it  tries  to  "force"  a production  to  fit  the  stack. 

If  FMA  terminates  in  case  "otherwise" , we  are  given 
a choice  of  one  or  more  productions  to  try  to  use  or 
possibly  a read  transition.  Only  some  of  these  choices 
are  of  practical  use  in  improving  a repair  strategy,  as 
follows.  Classify  the  productions  as  either  "long"  or 
"short"  depending  upon  whether  reduction  by  them  would 
consume  the  ? state.  Long  productions  give  us  an  indi- 
cation of  what  the  stack  preceding  the  ? should  be;  we 
can  submit  each  of  these  to  SF,  in  the  hope  that  at  least 
one  can  be  forced  to  fit.  Short  productions  can  also  give 
us  some  information  with  regard  to  this  portion  of  the 
stack;  this  information  is  not  explicit  but  is  buried 
within  the  CFSM  transitions  and  the  items  in  the  states. 

A way  to  extract  it  is  to  perform  the  reduction  and  continue 
parsing,  awaiting  a long  production  that  can  tell  us  some- 
thing explicit.  We  believe  such  an  approach  may  be  too 
cumbersome  to  be  useful . 
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If  a read  transition  is  among  our  choices,  let  the 
items  associated  with  the  read  transition  be  A.  -+•  x..ty. 
where  t is  the  symbol  to  be  read.  If  we  choose  to  read 
t,  then  one  of  the  strings  must  match  the  top  of  the 

stack,  and  we  can  verify  this  before  reading  t.  There 
are  both  long  and  short  such  strings  , and  the  long 
strings  can  give  us  information  about  the  stack  preceding 
?.  Unfortunately,  to  use  the  read  items  we  must  keep  the 
actual  items  around  during  parse  time,  a requirement  that 
is  uneconomical  in  space. 

Among  all  of  the  possibilities  presented  when  FMA 
halts  with  an  inadequate  transition,  the  next  move  property 
tells  us  that  one  must  be  the  "correct"  choice.  As  we  have 
noted,  long  productions  may  be  of  immediate  use,  but  we  do 
not  see  obvious  or  simple  ways  of  using  the  other  choices. 

Summary.  Thus,  an  error  recovery  algorithm  incor- 
porates (1)  the  gathering  of  right  context,  which  RCA 
outlines  how  to  do,  and  (2)  the  application  of  an  error 
repair  strategy,  which  we  have  not  specified,  but  for 
which  we  have  made  some  suggestions.  We  further  suggest 
that  if  the  strategy  fails  to  succeed,  then  we  apply  the 
algorithm  recursively,  again  gathering  right  context  and 
attempting  error  repair  in  the  hope  that  some  later  correc- 
tion can  repair  more  than  one  error.  The  recursive  approach 
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ensures  that  we  never  stop  trying  to  parse  the  input, 
therefore  preventing  the  algorithm  from  totally  failing 
when  we  cannot  correct  some  error,  or  when  there  are 
multiple  errors  in  the  input. 
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Chapter  6 . 

MAKING  FMA  PRACTICAL 

In  this  section  we  show  how  to  convert  the  state  sets 
manipulated  by  FMA  into  other  states  and  precompute  the 
transitions  between  these  new  states. 

We  have  described  FMA  as  an  algorithm  that  manipu- 
lates sets  of  states  in  an  attempt  to  keep  track  of  many 
state  stacks  at  once.  FMA  computes  state  sets  dynamically 
by  referring  to  the  CFSM;  e.g.,  cases  {read}  and 
{A  -*■  w}  compute  the  next  state  from  the  previous  state 
Q by  calculating  (q'  | q — — > q'  and  q e Q}  (s  is 

h or  A) . There  is  no  reason  why  we  cannot  precompute 
these  state  sets  and  the  transitions  between  them;  this 
gives  rise  to  an  error  recovery  FSM  (ERFSM) . For  a grammar 
G,  let  (K, START, SIGMA, V' ,F)  be  its  CFSM.  The  ERFSM 
of  G is  the  6-tuple  (K ' , ? ,ERSIGMA, V 1 ,F ' ) where  ? = K 
and  F'  = { { f } | feF}={  { {}  } }.  K'  is  computed  as 
follows:  Begin  with  K'  = {?}.  Repeatedly  add  to  K' 

the  successors  of  state  sets  in  K',  where  if  s e V' , 
the  s-successor  of  Q e K'  is  { q ' | q — > q'  and  q e Q). 

Thus  for  Q,  Q*  e K' , seV',  ERSIGMA (Q , s ) = Q'  iff  Q' 
is  the  s-successor  of  Q in  the  ERFSM.  We  can  in  a 
simple  way  specify  the  look-ahead  function  LA(Q,A  -*•  w) 


for  elements  of  K' ; LA  (Q , A -*  w)  = q LA(q,A  -*•  w)  . 

The  parsing  decision  function  can  now  be  computed  as  for 
the  CFSM;  due  to  the  construction  of  the  ERFSM,  we  can 
show  that  for  Q e K' , seV',  PD(Q,s)  = PD(g,s). 

q y 

Figure  3 shows  the  ERFSM  for  the  CFSM  of  Figure  2. 

FMA  now  need  not  do  dynamic  state  computation;  we  can 
use  the  ERFSM  and  algorithm  FMA'  below  to  achieve  the 
same  effect: 

FMA'  . 

Push?:  Push  ? on  the  stack. 

Readh:  Let  h = head  of  input. 

Push  ERSIGMA  (?,h)  on  the  stack;  read  h. 

Parse  repeatedly  according  to  the  following  rules: 
Let  h = head  of  input,  Q = state  on  top  of  stack. 
Let  PD  = PD (Q,h) . 
do  case  PD: 

case  {read}:  Read  h and  push  ERSIGMA (Q,h) . 
case  {A  w}  : Ensure  that  |w|  states  reside 
on  the  top  of  the  stack  following  the  ? 


Push  ERSIGMA (Q, A)  on  the  stack, 
case  {}:  Halt,  signalling  an  error. 
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case  {accept} : Halt;  we  have  consumed  all 

but  the  ]_. 

case  otherwise  (i.e.  | PD  J > 1):  Halt. 

end  FMA 1 . 

The  fact  that  FMA  and  FMA1  are  equivalent  should  not 
be  difficult  to  see  based  on  the  construction  of  the  ERFSM. 
Note  that  FMA1  is  much  like  the  normal  parsing  algorithm 
in  that  it  manipulates  only  states. 

Now  that  FMA1  manipulates  states  rather  than  state 
sets,  we  can  suggest  a space  optimization  on  the  ERFSM. 
Suppose  for  some  q e K,  {q}  e K'  (this  occurs  often; 
see  Figure  3) . If  q — > q'  is  a transition  of  the 
CFSM , then  {q}- — > { q * } is  a transition  of  the  ERFSM. 
Once  FMA'  pushes  a state  { q } on  its  stack,  and  until 
it  sometime  later  pops  {q},  it  will  behave  as  if  it  had 
pushed  state  q on  its  stack.  Thus  we  may  "share"  state 
{q}  in  K'  with  state  q in  K;  states  in  K'  having 
transitions  into  {q}  can  be  modified  to  instead  have 
the  same  transitions  into  q.  Such  sharing  reduces  the 
storage  in  the  final  parser  + error  recovery  package.  The 
ERFSM  may  share  every  state  {q}  with  its  corresponding 
state  q in  the  CFSM.  The  following  criterion,  satisfied 
by  (but  not  only  by)  the  singleton  states  in  K' , determines 
whether  an  ERFSM  state  can  be  shared  with  a CFSM  state: 
(State  sharing  criterion)  for  any  q e K,  Q e K' , Q may 


share  with  q iff  for  every  y e V'*,  if  q gets  to  p 
by  y and  Q gets  to  P by  y then  PD(p,h)  = PD(P,h) 
for  every  h e V.  Phrased  differently,  if  y describes 
a path  from  q to  p in  the  CFSM  and  a path  from  Q 
to  P in  the  ERFSM,  the  parsing  decisions  that  P and 
p make  must  be  the  same.  States  in  K'  other  than 
singleton  sets  satisfy  this  criterion.  To  see  this,  let 
tg  = {A  ->•  t.}  and  t^  = {A  t.  , B -*■  t.},  both  members 
of  K.  Let  {tQ,t^}  e K' . Note  that  tgU  t^  = t^ . 

Then  if  PDCt-^h)  = PD ({ tg , t^} ,h)  for  every  h e V, 

{tg,t^}  may  be  shared  with  t^.  This  is  the  same  as 
requiring  that  the  look-ahead  for  production  A -*  t in 
state  tg  be  a subset  of  the  look-ahead  for  production 
A -*  t in  state  t^.  Non-singleton  states  that  can  be 
shared  occur  in  practice,  but  they  are  non-trivial  to 
check  for.  Singleton  states  are  very  easy  to  check  for 
when  generating  the  ERFSM,  and  the  LALR  generator  at 
UC  Santa  Cruz  does  this.  Figure  4 shows  the  shared  ERFSM 
for  the  ERFSM  of  Figure  3. 

For  grammars  we  have  run,  which  include  a grammar 
for  PASCAL,  from  60  to  80  percent  of  the  ERFSM  states 
may  be  shared,  resulting  in  a substantial  savings  in  space. 

FMA ' resembles  the  technique  of  Druseikis  and  Ripley. 
However,  they  (1)  do  not  have  a unique  start  state  with 
which  to  begin  the  forward  move,  (2)  do  not  consider  states 
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iin  the  ERFSM  to  be  sets  of  states  in  K but  rather 

actual  item  sets  (our  research  independently  started  out 
that  way  but  study  revealed  that  the  item  sets  were  unions 
of  item  sets  of  states  in  K,  so  that  ERFSM  states  were 
conceptually  better  modelled  and  computed  as  sets  of  states 
in  K) , (3)  handle  the  problem  only  for  SLR  grammars 

(they  claim  that  the  generalization  to  LALR  is  straight- 
forward, but  their  paper  does  not  indicate  the  greater 
difficulty  in  computing  LALR  look-ahead  sets  for  the 
ERFSM;  they  merely  attach  SLR  look-ahead  sets  to  every 
production  in  the  ERFSM,  and  SLR  look-ahead  sets  are 
computed  independently  of  the  state  in  which  the  final 
item  appears).  Our  technique  works  in  general  for  LR 
parsers  of  any  type,  handling  SLR  as  a special  case.  In 
addition,  the  number  of  states  in  our  CFSM  plus  the  number 
of  states  in  our  ERFSM  can  be  up  to  |v|  - 1 fewer  than 
the  number  of  states  needed  by  Druseikis  and  Ripley  to 
implement  the  parser  and  error  recovery  machine;  this  is 
, due  to  the  | V j start  states  needed  by  their  error  recovery 


machine. 
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Chapter  7. 
CONCLUSION 


We  have  provided  a method  to  do  the  forward  move 
of  Graham  and  Rhodes  for  LR  parsers  in  a practical  and 
efficient  manner.  We  have  shown  that  our  algorithm  FMA 
carries  the  forward  move  along  as  far  as  it  possibly  can 
before  halting,  and  that  the  results  of  it  are  useful  in 
selecting  and  validating  error  repairs.  Given  the  simple 
error  assumption  we  have  described  how  FMA  can  be  used 
to  gather  forward  context,  and  have  indicated  how  an  error 
recovery  strategy  might  employ  the  gathered  context.  At 
UC  Santa  Cruz  an  error  recovery  strategy  using  forward 
context  is  in  development  which  so  far  has  proven  success- 
ful in  practice. 

Further  research . We  have  left  unexplored  many  areas 
related  to  FMA.  In  particular,  some  of  them  are 

(1)  How  large  is  the  ERFSM  in  comparison  to 
the  CFSM? 

Are  their  sizes  linearly  related? 

How  is  this  related  to  the  grammar? 

(2)  On  "the  average",  how  much  forward  text  does 
FMA  consume? 


fed 
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What  circumstances  permit  FMA  to  consume  a 
lot  of  forward  text? 

How  are  these  circumstances  related  to  language 
constructs? 

(3)  We  define  a grammar's  "robustness"  to  be  pro- 
portional to  how  much  forward  text  FMA  consumes 
on  "the  average". 

Is  there  an  algorithm  that  indicates  weak  spots 
in  a grammar,  i.e.  where  the  grammar  is  not 
robust? 

(4)  What  better  or  other  ways  may  forward  context 
be  used  in  error  repair? 
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S -►  Program  J 
Program  -*■  Stmt  . 


Stmt  -*•  integer  Id  list , 
-*•  Id  :=  Exp 


-*■  for  Id  :=  Exp  step  Exp  until  Exp  do  Stmt 
-*■  begin  Stmt  list  ; end 
-*■  while  Exp  do  Stmt 


Exp 


Id 

Int 


Id 

Int 


' <IDENTIFER> ' 
' < INTEGER> ' 


Figure  1.  Grammar  for  a simple  Algol-like  language. 
' < IDENTIFER> ' and  ' <INTEGER> ' represent  the  generic 
classes  of  identifiers  and  integers  respectively. 
"A  list  B"  means  a list  of  A's  separated  by  B's. 
Capitalized  strings  are  the  only  nonterminals. 
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